Question sets

Question 1

You suspect a casino coin is unfair. Let \(p\) be the probability of the coin landing on heads (1).

1.1

Problem: After five trials, you observe the sequence \([1, 1, 0, 0, 0]\). Please derive the Maximum Likelihood Estimate (MLE) for the probability of getting heads.

In a sequence of \(n\) independent Bernoulli trials, the likelihood function for \(p\) is: \[ L(p) = p^k (1-p)^{n-k} \]

where \(n=5\) and \(k=2\) (the number of heads).

To find the MLE, we maximize the log-likelihood \(l(p)\): \[ l(p) = 2 \ln(p) + 3 \ln(1-p) \]

Taking the first derivative with respect to \(p\) and setting it to zero: \[ \begin{aligned} \frac{dl}{dp} =˙ \frac{2}{p} - \frac{3}{1-p} = 0 \\ 2(1-p) =& 3p \implies 2 - 2p = 3p \implies 5p = 2 \\ hat{p}_{MLE} =& \frac{2}{5} = 0.4 \end{aligned} \]

1.2

This time we incorporate the prior on the coin having head (1) follows Beta distribution (i.e., \(p \sim B(2,8)\)). What will be the maximum a posterori estimation of the probability of getting the head ?

The Beta distribution is the conjugate prior for the Binomial likelihood. If the prior is \(\text{Beta}(\alpha, \beta)\) and we observe \(k\) successes in \(n\) trials, the posterior is: \[ p | \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k) \]

Plugging in the values (\(\alpha=2, \beta=8, k=2, n=5\)): \[ p | \text{data} \sim \text{Beta}(2 + 2, 8 + 3) = \text{Beta}(4, 11) \]

The MAP estimate is

\[ \hat{p}_{MAP} = \frac{4 - 1}{4 + 11 - 2} = \frac{3}{13} \approx 0.231 \]

Conclusion: The MAP estimate is \(\approx 0.231\), which shifts the MLE (\(0.4\)) toward the prior mean (\(0.2\)).

1.3

How does weak and stronger prior (let, say \(p \sim B(20,80)\), noted that the prior belief is stil at 0.2 chance to get head) affect the MAP estimation?

Both \(\text{Beta}(2, 8)\) and \(\text{Beta}(20, 80)\) have the same mean (\(0.2\)), but the latter has a much smaller variance, representing a “stronger” or more certain belief.

  • Weak Prior \(\text{Beta}(2, 8)\): As calculated above, \(\hat{p}_{MAP} \approx 0.231\). The data has a significant impact on the estimate.With - Strong Prior \(\text{Beta}(20, 80)\): The new posterior is \(\text{Beta}(20+2, 80+3) = \text{Beta}(22, 83)\). \[ \hat{p}_{MAP} = \frac{22 - 1}{22 + 83 - 2} = \frac{21}{103} \approx 0.204 \]

A stronger prior is more resistant to change from new data. Even though we observed \(40\%\) heads in our sample, the strong prior dominates the calculation, resulting in an estimate (\(0.204\)) much closer to the prior mean (\(0.20\)) than the weak prior estimate (\(0.231\)).

Question 2

For a linear regression model (for the simplicity we do not comnsider the intercept in this case), \(y=x\beta_1+\epsilon\), where \(\epsilon \sim N(0,\sigma^2)\). Implicitly \(y \sim N(X\beta_1,\sigma^2)\)

2.1

Please show the log-likelihood for \(N\) of observations \((y_i,x_i)\) given \(\beta\) is

\[ l(\beta_1,\sigma^2,y,x) = -\frac{N}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-x_i\beta_1)^2 \]

Since \(\epsilon_i \sim N(0, \sigma^2)\), the response variable follows the distribution \(y_i \sim N(x_i\beta_1, \sigma^2)\). The probability density function (PDF) for a single observation is: \[ f(y_i | x_i, \beta_1, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - x_i\beta_1)^2}{2\sigma^2} \right) \]

Assuming the observations are independent and identically distributed (i.i.d.), the likelihood function \(L(\beta_1, \sigma^2)\) is the product of the individual densities: \[ \begin{aligned} L(\beta_1, \sigma^2) =& \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - x_i\beta_1)^2}{2\sigma^2} \right)\\ L(\beta_1, \sigma^2) =& (2\pi\sigma^2)^{-N/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i\beta_1)^2 \right) \end{aligned} \]

Taking the natural logarithm to find the log-likelihood \(l = \ln(L)\):

\[ \begin{aligned} l(\beta_1, \sigma^2) =& \ln \left[ (2\pi\sigma^2)^{-N/2} \right] + \ln \left[ \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i\beta_1)^2 \right) \right] \\ l(\beta_1, \sigma^2) =& -\frac{N}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - x_i\beta_1)^2 \end{aligned} \]

2.2

Following to question 2.1, please show that the MLE estimation for \(\beta_1\) is

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^N x_i^2} \]

To find the MLE \(\hat{\beta}_1\), we maximize the log-likelihood by taking the partial derivative with respect to \(\beta_1\) and setting it to zero:

  1. Differentiate: \[ \frac{\partial l}{\partial \beta_1} = \frac{\partial}{\partial \beta_1} \left[ -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i\beta_1)^2 \right] \]

Applying the chain rule

\[ \begin{aligned} \frac{\partial l}{\partial \beta_1} =& -\frac{1}{2\sigma^2} \sum_{i=1}^N 2(y_i - x_i\beta_1)(-x_i)\frac{\partial l}{\partial \beta_1} \\ =& \frac{1}{\sigma^2} \sum_{i=1}^N (x_i y_i - x_i^2 \beta_1) \end{aligned} \]

  1. Set to Zero: \[ \frac{1}{\sigma^2} \left( \sum_{i=1}^N x_i y_i - \hat{\beta}_1 \sum_{i=1}^N x_i^2 \right) = 0 \]

  2. Solve for \(\hat{\beta}_1\):

\[ \begin{aligned} \sum_{i=1}^N x_i y_i =& \hat{\beta}_1 \sum_{i=1}^N x_i^2 \\ \hat{\beta}_1 =& \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^Nx_i^2} \end{aligned} \]

2.3

The null model for linear model an intercept only model, that is \(y \sim N(\beta_0, \sigma^2)\). Please show the log-likelihood for \(N\) of observations under null model is

\[ l(\beta,\sigma^2,y,x) = -\frac{N}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-\beta_0)^2 \]

Replace \(x\beta_1\) to \(\beta_0\) in 2.1 you will get it

2.4

Please show that the Likelihood Ratio Test (LRT) of the null hypothesis \(H_0: \beta_1=0\) and alternative hypothesis \(H_1: \beta_1\neq 0\) is

\[ \frac{1}{\sigma^2}(\sum_{i=1}^N(y_i-x_i\beta_1)^2-\sum_{i=1}^N(y_i-\beta_0)^2) \]

The LRT statistic is \(-2 \ln(\frac{L(H_0)}{L(H_1)}) = -2[l(H_0) - l(H_1)]\)

Substitue \(l(H_0)\) from 2.3 and \(l(H_1)\) in 2.1, you will get the answer

2.5

In the machine learning perspective, we usually ask the model to minimize the mean square error (MSE, \((y_i-\hat{y_i})^2\), where \(\hat{y_i}\) is the predicted value, \(x_i\hat{\beta}_1\) in our case). Please describe why minimize MSE is equivalent to find an MLE.

In the MLE calculation, actually we are minimizing the function \((y_i-x_i\beta_1)^2\) which is exactly the definition of MSE

2.6

In typical linear regression, \(R^2\) or adjusted-\(R^2\) is usually a more common choice indicating how good is a model. Please describe what are the advantage of likelihood ratio test over \(R^2\)

Advantages include:

  1. Comparing on different models, you can change null model into any other subset of predicting variables to do model selection
  2. More generalized framework, LRT is generalizable to logistic, Poisson or negative binomial regression as well

Question 3

A scientist is studying the effect of a new drug on the expression level of a specific gene. They have measured the expression levels in two small groups of mice: Control group (\(n=3\)) and a Treatment group (\(m=3\)). The expression level of these mice

  • Control (C): \(\{10, 12, 14\}\)
  • Treatment (T): \(\{18, 20, 22\}\)

We want to test if the drug significantly increases gene expression in mean using a Permutation Test.

3.1

What is the null and alternative hypothesis of the test

Null Hypothesis (\(H_0\)): \(\mu_C = \mu_T\)

There is no difference in the distribution of gene expression between the Control and Treatment groups.

Alternative Hypothesis (\(H_1\)): \(\mu_T > \mu_C\)

This is a one-tailed test.

3.2

What is the test statistics \(\Delta_{obs}\)?

\[ \begin{aligned} \Delta =& \bar{X}_T - \bar{X}_C \\ \Delta_{obs} =& 20 - 12 = 8$ \]

3.3

Please list out at least 3 kind of permutation and its corresponding test-statistics (\(\Delta_{perm}\))

  • Permutation A (The observed case):

\[ T = \{18, 20, 22\}, C = \{10, 12, 14\} \implies \Delta = 8 \]

  • Permutation B (A “mixed” case):

\[ T = \{10, 18, 22\}, C = \{12, 14, 20\} \implies \bar{X}_T = 16.67, \bar{X}_C = 15.33 \implies \Delta = 1.34 \]

  • Permutation C (The “inverse” case): \[ T = \{10, 12, 14\}, C = \{18, 20, 22\} \implies \bar{X}_T = 12, \bar{X}_C = 20 \implies \Delta = -8 \]

3.4

If after these 25 permutations, only 1 (the original data) results in a difference \(\ge \Delta_{obs}\), what is the p-value?

The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, assuming \(H_0\) is true. \[ p = \frac{\text{Number of permutations where } \Delta \ge \Delta_{obs}}{\text{Total number of permutations}} = \frac{1}{25} = 0.025 \]

3.5

What is the assumption of permutation test?

Exchangeability: order of the observation does not matter, the only thing make the difference is the observed case is somehow you purposely to do that to make it so extreme

3.6

What is the advantages of permutation test?

It assumes no distribution

Question 4

Suppose we have a set of independent and identically distributed (i.i.d.) observations \(X_1, X_2, \dots, X_n\) from a Normal distribution with known mean \(\mu\) and unknown variance \(\sigma^2\):

\[ X_i \sim N(\mu, \sigma^2) \]

Please show that MLE for the variance \(\sigma^2\) is: \[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \]

(Note: For this exercise, assume \(\mu\) is a known constant. If \(\mu\) were unknown, we would replace it with the sample mean \(\bar{X}\).)

The Likelihood is: \[ \begin{aligned} L(\sigma^2) =& \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(X_i - \mu)^2}{2\sigma^2} \right)L(\sigma^2) \\ =& \left( 2\pi\sigma^2 \right)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 \right) \end{aligned} \]

To simplify the differentiation, we tcalculate the log-likelihood: \[ \ell(\sigma^2) = \ln L(\sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 \]

Take the derivative with respect to our parameter of interest, \(\sigma^2\), and set it to zero: \[ \frac{d}{d(\sigma^2)} \ell(\sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (X_i - \mu)^2 = 0 \]

Multiply by \(2(\sigma^2)^2\): \[ \begin{aligned} -n\sigma^2 + \sum_{i=1}^n (X_i - \mu)^2 = 0 \\ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \end{aligned} \]

Question 5

If you recall from the previous course, a more common approach on contingency table is Pearson’s \(\chi^2\) test. But this question we will do it in a likelihood ratio manner. We are investigating the effect of a new treatment and it’s effect oncuring cancer. We get the following contingency table

Cured Not cured Marginal
Treatment \(O_{11}\) \(O_{12}\) \(O_{11}+O_{12}\)
Not Treatment \(O_{21}\) \(O_{22}\) \(O_{21}+O_{22}\)
Marginal \(O_{11}+O_{21}\) \(O_{12}+O_{22}\) \(N\)

5.1

A contingency table with \(n\) total observations follows a Multinomial Distribution. The likelihood function for observing counts \(O_{ij}\) with probabilities of getting at each cell \(p_{ij}\) is: \[ L = \frac{n!}{\prod O_{ij}!} \prod p_{ij}^{O_{ij}} \]

If the tretment and outcome are not independent, then the MLE estimation of \(p_{ij}\) is \(\frac{O_{ij}}{N}\)

If the tretment and outcome are independent, then we restrict \(p_{ij}\) to \(\frac{\text{Row total}}{N}\frac{\text{Column total}}{N}\), so the expected count (\(E_{ij}\)) is \(p_{ij}\times N = \frac{\text{Row total}\times\text{Column total}}{N}\)

Please show the LRT test-statisitcs

\[ -2 \ln \lambda =-2 \sum O_{ij} \ln\left( \frac{E_{ij}}{O_{ij}} \right) \]

\[ \begin{aligned} -2\ln(\lambda) =& -2\ln\Bigg( \frac{\prod \left( \frac{E_{ij}}{n} \right)^{O_{ij}}}{\prod \left( \frac{O_{ij}}{n} \right)^{O_{ij}}}\Bigg) \\ =& -2 \Big(\sum O_{ij} \ln\left( \frac{E_{ij}}{n} \right) - \sum O_{ij} \ln\left( \frac{O_{ij}}{n} \right)\Big) \\ =& -2\sum\ln\left( \frac{E_{ij}/n}{O_{ij}/n} \right) \\ =& -2\sum O_{ij} \ln\left( \frac{E_{ij}}{O_{ij}} \right) \end{aligned} \]

5.2

In the \(2 \times 2\) contingency table please show that Pearson’s \(\chi^2\) test-statisitcs (\(X^2 = \sum \frac{(O_i - E_i)^2}{E_i}\)) is an approximation to LRT at 2-order Tyler expansion

Let \(\delta_i = O_i - E_i\) is deviation of the observed count from the expected count.

Note that \(\sum \delta_i = 0\) (i.e. \(\sum O_i = \sum E_i\)) because the sum of observed counts must equal the sum of expected counts.

Rewrite the formula in 2.1 by \(\frac{O_i}{E_i} = \frac{E_i + \delta_i}{E_i} = 1 + \frac{\delta_i}{E_i}\), then

\[ G = 2 \sum O_i \ln\left(1 + \frac{\delta_i}{E_i}\right) \]

The Taylor series expansion for \(\ln(1+x)\) around \(x=0\) is:

\[ \ln(1+x) \approx x - \frac{x^2}{2} + \frac{x^3}{3} - \dots \]

Applying this to our term \(\ln\left(1 + \frac{\delta_i}{E_i}\right)\), where \(x = \frac{\delta_i}{E_i}\)

\[ \ln\left(1 + \frac{\delta_i}{E_i}\right) \approx \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \]

Substitute this approximation back into the \(G\) formula:

\[ \begin{aligned} G \approx& 2 \sum O_i \left( \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \right) \\ = & 2 \sum (E_i + \delta_i) \left( \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \right) \\ = & 2 \sum \left( \delta_i - \frac{\delta_i^2}{2E_i} + \frac{\delta_i^2}{E_i} - \frac{\delta_i^3}{2E_i^2} \right) \end{aligned} \]

Ignore the higher-order term (\(\delta_i^3\)) as it becomes negligible when \(O_i\) is close to \(E_i\).

Simplify the remaining terms

\[ \begin{aligned} G \approx& 2 \left( \sum \delta_i + \sum \frac{\delta_i^2}{2E_i} \right) \\ =& 2 \left( 0 + \frac{1}{2} \sum \frac{\delta_i^2}{E_i} \right) \\ =& \sum \frac{\delta_i^2}{E_i}\\ =& \sum \frac{(O_i - E_i)^2}{E_i} \end{aligned} \]

5.3

We have discussed Pearson’s \(\chi^2\) test, LRT and Fisher exact test. Please comment about when to use which test.

  1. In large sample Pearson’s \(\chi^2\) test, LRT are similar due to asymptotical property
  2. In a relative smaller sample LRT is a better choice as Pearson’s \(\chi^2\) test is an approximation of LRT
  3. For really small sample size \(O_{ij}\leq 5\) asymptotical property for LRT also not hold Fisher’s exact test is recommend

On the other hand some other considerations

  1. Fisher exact test is inappropriate for large sample size given the calculation difficulty (imagine you are calculating \(\binom{120}{37}\))
  2. LRT again is more generalizable in terms of that you do not need to do any p-value adjustment
  3. Pearson’s \(\chi^2\) test is the most popular given its closed form solution (and relatively easy to calculate)